Quantitative Methodology (UPF)
“Data Scientists spend up to 80% of the time on data cleaning and 20% on actual data analysis”.
Source: R for Data Science
A df is tidy if it fulfills these requirements (Wickham 2014):
But …
# A tibble: 8,131 × 5
nombre_de_provincia nombre_de_municipio votos_validos pp mas_pais
<chr> <chr> <dbl> <dbl> <dbl>
1 Almería Abla 710 193 0
2 Almería Abrucena 699 111 0
3 Almería Adra 10941 3126 0
4 Almería Albánchez 272 56 0
5 Almería Alboloduy 419 155 0
6 Almería Albox 4335 1261 0
7 Almería Alcolea 509 92 0
8 Almería Alcóntar 371 125 0
9 Almería Alcudia de Monteagud 106 47 0
10 Almería Alhabia 409 95 0
# … with 8,121 more rows
Why do we need to clean data?
Load packages.
Number of TB cases documented by the WHO in Afghanistan, Brazil, and China between 1999 and 2000 (cases & population).
Table A
# A tibble: 6 × 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Table B
# A tibble: 6 × 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Table C
# A tibble: 12 × 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Table D
# A tibble: 3 × 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
NOO ANOTHER
Correlates of War Intra-State War Data
# A tibble: 442 × 4
WarName WarType SideA SideB
<chr> <dbl> <chr> <chr>
1 First Caucasus 5 Russia Georgians, Dhagestania, Chechens
2 Sidon-Damascus 6 Sidon Damascus & Aleppo
3 First Two Sicilies 4 Austria -8
4 First Two Sicilies 4 Two Sicilies Liberals
5 Spanish Royalists 4 Spain Royalists
6 Sardinian Revolt 4 Austria -8
7 Sardinian Revolt 4 Sardinia Carbonari
8 Greek Independence 5 Ottoman Empire Greeks
9 Greek Independence 5 -8 United Kingdom
10 Greek Independence 5 -8 France
# … with 432 more rows
PEW Research Religious Landscape Study1
# A tibble: 18 × 6
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76
2 Atheist 12 27 37 52 35
3 Buddhist 27 21 30 34 33
4 Catholic 418 617 732 670 638
5 Don’t know/refused 15 14 15 11 10
6 Evangelical Prot 575 869 1064 982 881
7 Hindu 1 9 7 9 11
8 Historically Black Prot 228 244 236 238 197
9 Jehovah's Witness 20 27 24 24 21
10 Jewish 19 19 25 25 30
11 Mainline Prot 289 495 619 655 651
12 Mormon 29 40 48 51 56
13 Muslim 6 7 9 10 9
14 Orthodox 13 17 23 32 32
15 Other Christian 9 7 11 13 13
16 Other Faiths 20 33 40 46 49
17 Other World Religions 5 2 3 4 2
18 Unaffiliated 217 299 374 365 341
We change the rows and columns of the dataframe keeping the same information.
pivot_longer(df, cols, names_to, names_to)
pivot_wider(df, names_from, values_from)
separate(df, col, into, sep)
unite(df, col, ..., sep)
NA (Not Available)
Related with the state capacity to collect them (Stone2008?)
Check the World Bank
Small or underdeveloped states
is.na()
Quantitative Methodology (UPF)